IDA_General tasks_33.ipynb

  1. a) How is the logistics delay distributed? Justify your choice with statistical tests and briefly describe your approach.
  2. Data Visualization
    1. To get a better understanding of the distribution, we will start with a visualization of the data:
  3. Statistical Tests
    1. After visualizing the data, we perform different statistical test to determine the type of distribution
  4. Step 1: Import Packages
  5. Step 2: Data Preparation
    1. Merge Datasets
    2. Handle Data
  6. Step 3 Visualize Data
  7. Step 4 Define Features and Target Variable
    1. Target Variable:
    2. Features:
  8. Step 5: Splitting the data
  9. Step 6: Building the decision tree
  10. Step 7: Evaluate the model
  11. Step 8: Visualize the decision tree
  12. Step 9: Final model interpretation
    1. Data security:
    2. Enhanced performance
    3. Data Integrity
    4. Scalability and ease of integration.
    5. Easier maintenance and debugging
    6. The name of such a database structure is: Relational Database Structure
Skip to Main
Jupyter

IDA_General tasks_33

Last Checkpoint: 4 minutes ago
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Settings
  • Help
Kernel status: Idle Executed 83 cellsElapsed time: 52 seconds
<H4> Load all necessary Libraries

Load all necessary Libraries¶

[1]:
import os
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Import necessary libraries and packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
from sklearn.cluster import KMeans

import warnings
<h5>Function to read txt data files and convert them to proper csv files :
1. txtFile: input filename (including directory if applicable)
2. csvFile: output filename (including directory if applicable)
3. vtabchar: vertical tab character in the original file (to be replaced with newline command '\n')
4. delim: delimiter character used in the original file (to be replaced with comma)
Function to read txt data files and convert them to proper csv files :
  1. txtFile: input filename (including directory if applicable)
  2. csvFile: output filename (including directory if applicable)
  3. vtabchar: vertical tab character in the original file (to be replaced with newline command '\n')
  4. delim: delimiter character used in the original file (to be replaced with comma)
¶
[2]:

def txt2csv(txtFile, csvFile, vtabchar, delim):
with open(txtFile, 'r') as file:
data = file.read().replace(vtabchar, '\n').replace(delim, ',')
with open(csvFile, 'w') as file:
file.write(data)
return
<h4> Define the file paths

Define the file paths¶

[3]:

txt_file_path = "Einzelteil_T16.txt"
csv_file_path = "Einzelteil_T16.csv"

[4]:
txt2csv(txt_file_path, csv_file_path, ' ', ' | | ')
<h5> Load all relevant Files
Load all relevant Files¶
[5]:
Einzelteil = pd.read_csv(csv_file_path, low_memory=False )
[6]:
Einzelteil.head()
[6]:
X1 ID_T16.x Produktionsdatum.x Herstellernummer.x Werksnummer.x Fehlerhaft.x Fehlerhaft_Datum.x Fehlerhaft_Fahrleistung.x ID_T16.y Produktionsdatum.y ... Fehlerhaft.y Fehlerhaft_Datum.y Fehlerhaft_Fahrleistung.y ID_T16 Produktionsdatum Herstellernummer Werksnummer Fehlerhaft Fehlerhaft_Datum Fehlerhaft_Fahrleistung
1 1 16-212-2121-7 2008-11-07 212.0 2121.0 0.0 NaN 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 16-212-2122-41 2008-11-08 212.0 2122.0 0.0 NaN 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 5 16-212-2121-36 2008-11-07 212.0 2121.0 0.0 NaN 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 10 16-212-2122-20 2008-11-07 212.0 2122.0 0.0 NaN 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 12 16-212-2122-33 2008-11-07 212.0 2122.0 0.0 NaN 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 22 columns

[7]:

Fahrzeuge_df = pd.read_csv("Fahrzeuge_OEM1_Typ11_Fehleranalyse.csv")

[8]:
Fahrzeuge_df.head()
[8]:
Unnamed: 0 X X1 ID_Fahrzeug Herstellernummer Werksnummer Fehlerhaft_Datum Fehlerhaft_Fahrleistung days fuel engine
0 9 9 9 11-1-11-9 1 11 2010-03-16 34824.319559 1493.150761 4.003670 small
1 11 11 11 11-1-11-11 1 11 2010-03-16 74217.428309 1044.462231 11.042487 large
2 13 13 13 11-1-11-13 1 11 2010-03-16 32230.699639 749.669810 3.579117 small
3 15 15 15 11-1-11-15 1 11 2010-03-16 44885.783551 858.688003 4.666801 small
4 37 37 37 11-1-11-37 1 11 2010-03-17 86348.329866 1478.204174 4.634381 small
[9]:
#load all necessary files

##load dataset "Komponente_K7.csv" as komponente_df
komponente_k7_df = pd.read_csv("Komponente_K7.csv", delimiter=';')
##load dataset "Logistikverzug_K7.csv" as logistikverzug_df
logistikverzug_df = pd.read_csv("Logistikverzug_K7.csv", delimiter=',')
[10]:
logistikverzug_df = pd.read_csv(r'C:\Users\A R K A\Documents\IDA\Logistikverzug_K7.csv')

[11]:
logistikverzug_df.head()
[11]:
Unnamed: 0 IDNummer Wareneingang Herstellernummer Werksnummer Fehlerhaft
0 1 K7-113-1132-153160 2016-11-22 112 1132 0
1 2 K7-113-1132-153109 2016-11-20 112 1132 0
2 3 K7-113-1132-153195 2016-11-20 112 1132 0
3 4 K7-113-1132-153226 2016-11-20 112 1132 0
4 5 K7-113-1132-153231 2016-11-20 112 1132 0
[12]:
# Read file with skipping bad lines
zulassungen_df = pd.read_csv("Zulassungen_alle_Fahrzeuge.csv", on_bad_lines='skip')

# Display the first few rows of the dataframe to ensure it loaded correctly
zulassungen_df.head()
[12]:
Unnamed: 0 IDNummer Gemeinden Zulassung
0 408097 11-1-11-1 DRESDEN 01-01-2009
1 408098 11-1-11-2 DRESDEN 01-01-2009
2 1 12-1-12-1 LEIPZIG 01-01-2009
3 2 12-1-12-2 LEIPZIG 01-01-2009
4 3 12-1-12-3 DORTMUND 01-01-2009
[13]:
Bestandteile_Fahrzeuge_OEM1_Typ12_df = pd.read_csv('Bestandteile_Fahrzeuge_OEM1_Typ12.csv',sep=';')
[14]:
Bestandteile_Fahrzeuge_OEM1_Typ12_df.head()
[14]:
Unnamed: 0 ID_Karosserie ID_Schaltung ID_Sitze ID_Motor ID_Fahrzeug
0 1 K5-112-1122-1 K3SG1-105-1051-4 K2ST1-109-1092-1 K1BE1-101-1011-90 12-1-12-1
1 2 K5-112-1122-11 K3AG1-105-1051-9 K2ST1-109-1092-16 K1BE1-101-1011-2 12-1-12-2
2 3 K5-112-1122-2 K3AG1-105-1051-23 K2ST1-109-1092-21 K1BE1-101-1011-8 12-1-12-3
3 4 K5-112-1122-3 K3AG1-106-1061-66 K2ST1-109-1092-70 K1BE1-101-1011-11 12-1-12-4
4 5 K5-112-1122-4 K3SG1-105-1051-8 K2ST1-109-1092-78 K1BE1-101-1011-42 12-1-12-5
[15]:
komponente_k7_df = pd.read_csv("Komponente_K7.csv" ,sep =';')

komponente_k7_df.head()
[15]:
Unnamed: 0 IDNummer Produktionsdatum Herstellernummer Werksnummer Fehlerhaft
0 1 K7-114-1142-1 2008-11-12 114 1142 0
1 2 K7-114-1142-2 2008-11-12 114 1142 0
2 3 K7-114-1142-3 2008-11-13 114 1142 0
3 4 K7-114-1142-4 2008-11-13 114 1142 0
4 5 K7-114-1142-5 2008-11-13 114 1142 0
[16]:
# Define the columns to read and their new names for Komponente
columns_to_read = {'ID_T16': 'ID_T16', 'ID_K2ST2': 'ID_Komponente'} # Original to new name mapping
columns_to_read_2 = {'ID_T16': 'ID_T16', 'ID_K2LE2': 'ID_Komponente'} # Original to new name mapping
columns_to_use = list(columns_to_read.keys()) # Specify which columns to read
columns_to_use_2 = list(columns_to_read_2.keys()) # Specify which columns to read
[17]:
#load Kopnonet Files
Komponente_K2LE2_df = pd.read_csv("Bestandteile_Komponente_K2LE2.csv",usecols=columns_to_use_2,sep=';')
Komponente_K2ST2_df = pd.read_csv("Bestandteile_Komponente_K2ST2.csv",usecols=columns_to_use,sep=';')
[18]:
# Rename the columns using the mapping
Komponente_K2LE2_df.rename(columns=columns_to_read_2, inplace=True)
Komponente_K2ST2_df.rename(columns=columns_to_read, inplace=True)

[19]:
Komponente_K2LE2_df.head()
[19]:
ID_T16 ID_Komponente
0 16-213-2132-44 K2LE2-111-1111-1
1 16-215-2152-68 K2LE2-111-1111-2
2 16-212-2121-9 K2LE2-111-1111-3
3 16-212-2121-16 K2LE2-111-1111-4
4 16-212-2121-19 K2LE2-111-1111-5
[20]:
#Merge component Files
Komponente_df = pd.concat([Komponente_K2LE2_df, Komponente_K2ST2_df], ignore_index=True)
[21]:
Komponente_df.head()
[21]:
ID_T16 ID_Komponente
0 16-213-2132-44 K2LE2-111-1111-1
1 16-215-2152-68 K2LE2-111-1111-2
2 16-212-2121-9 K2LE2-111-1111-3
3 16-212-2121-16 K2LE2-111-1111-4
4 16-212-2121-19 K2LE2-111-1111-5
[22]:
# Define the columns to read and their new names for Fahrzeuge
to_read = {'ID_Sitze': 'ID_Komponente', 'ID_Fahrzeug': 'IDNummer'} # Original to new name mapping
to_use = list(to_read.keys()) # Specify which columns to read

[23]:
#load Fahrzeuge Files
Fahrzeuge_OEM2_Typ21_df = pd.read_csv(r'C:\Users\A R K A\Documents\IDA\Bestandteile_Fahrzeuge_OEM2_Typ21.csv',usecols=to_use,sep=';')
Fahrzeuge_OEM2_Typ22_df = pd.read_csv(r'C:\Users\A R K A\Documents\IDA\Bestandteile_Fahrzeuge_OEM2_Typ22.csv',usecols=to_use,sep=';')
[24]:
#Merge Fahrzeuge Files
Fahrzeuge_Merged_df = pd.concat([Fahrzeuge_OEM2_Typ21_df, Fahrzeuge_OEM2_Typ22_df], ignore_index=True)
[25]:
# Rename the columns using the mapping
Fahrzeuge_Merged_df.rename(columns=to_read, inplace=True)
Fahrzeuge_Merged_df

Fahrzeuge_Merged_df

<h3>1. Logistics and Product Development in the Automobile Industry

1. Logistics and Product Development in the Automobile Industry ¶

[26]:
##### "Use the production date (“Produktionsdatum”) from the dataset “Komponente_K7.csv” and the receiving date of incoming goods (“Wareneingang”) from “Logistikverzug_K7.csv” (logistics delay). Assume that produced goods are issued one day after the production date. For the model design in Python, create a new dataset “Logistics delay” containing the required information from both datasets."

[27]:
komponente_k7_df.head()
[27]:
Unnamed: 0 IDNummer Produktionsdatum Herstellernummer Werksnummer Fehlerhaft
0 1 K7-114-1142-1 2008-11-12 114 1142 0
1 2 K7-114-1142-2 2008-11-12 114 1142 0
2 3 K7-114-1142-3 2008-11-13 114 1142 0
3 4 K7-114-1142-4 2008-11-13 114 1142 0
4 5 K7-114-1142-5 2008-11-13 114 1142 0
[28]:
logistikverzug_df.head()
[28]:
Unnamed: 0 IDNummer Wareneingang Herstellernummer Werksnummer Fehlerhaft
0 1 K7-113-1132-153160 2016-11-22 112 1132 0
1 2 K7-113-1132-153109 2016-11-20 112 1132 0
2 3 K7-113-1132-153195 2016-11-20 112 1132 0
3 4 K7-113-1132-153226 2016-11-20 112 1132 0
4 5 K7-113-1132-153231 2016-11-20 112 1132 0
[29]:
logistikverzug_df.head()
[29]:
Unnamed: 0 IDNummer Wareneingang Herstellernummer Werksnummer Fehlerhaft
0 1 K7-113-1132-153160 2016-11-22 112 1132 0
1 2 K7-113-1132-153109 2016-11-20 112 1132 0
2 3 K7-113-1132-153195 2016-11-20 112 1132 0
3 4 K7-113-1132-153226 2016-11-20 112 1132 0
4 5 K7-113-1132-153231 2016-11-20 112 1132 0
[30]:
#Drop the "Unnamed" column from both logistikverzug_df and komponente_df
logistikverzug_df.drop('Unnamed: 0', axis=1, inplace=True)
komponente_k7_df.drop('Unnamed: 0', axis=1, inplace=True)
[31]:
komponente_k7_df.head()
[31]:
IDNummer Produktionsdatum Herstellernummer Werksnummer Fehlerhaft
0 K7-114-1142-1 2008-11-12 114 1142 0
1 K7-114-1142-2 2008-11-12 114 1142 0
2 K7-114-1142-3 2008-11-13 114 1142 0
3 K7-114-1142-4 2008-11-13 114 1142 0
4 K7-114-1142-5 2008-11-13 114 1142 0
[32]:
#convert date columns to datetime
komponente_k7_df['Produktionsdatum'] = pd.to_datetime(komponente_k7_df['Produktionsdatum'])
logistikverzug_df['Wareneingang'] = pd.to_datetime(logistikverzug_df['Wareneingang'])

#assume goods are issued one day after the production date
komponente_k7_df['Issued_Products'] = komponente_k7_df['Produktionsdatum'] + pd.Timedelta(days=1)

# merge the datasets on IDNummer
merged_df = pd.merge(komponente_k7_df, logistikverzug_df, on='IDNummer')

#calculate the Logistics Delay
#The logistics delay is calculated as the difference between production date (issued date+1) and incoming date
#logistics delay = Wareneingang - Issued Products + 1 day (time since the production)

merged_df['LogisticsDelay'] = (merged_df['Wareneingang'] - merged_df['Issued_Products']).dt.days + 1

#create a new dataframe with all the important information
logistics_delay_df = merged_df[['IDNummer', 'Produktionsdatum', 'Issued_Products', 'Wareneingang', 'LogisticsDelay']]

#save the new dataframe
logistics_delay_df
[32]:
IDNummer Produktionsdatum Issued_Products Wareneingang LogisticsDelay
0 K7-114-1142-1 2008-11-12 2008-11-13 2008-11-19 7
1 K7-114-1142-2 2008-11-12 2008-11-13 2008-11-19 7
2 K7-114-1142-3 2008-11-13 2008-11-14 2008-11-20 7
3 K7-114-1142-4 2008-11-13 2008-11-14 2008-11-20 7
4 K7-114-1142-5 2008-11-13 2008-11-14 2008-11-19 6
... ... ... ... ... ...
306485 K7-113-1132-153241 2016-11-12 2016-11-13 2016-11-19 7
306486 K7-113-1132-153242 2016-11-12 2016-11-13 2016-11-19 7
306487 K7-113-1132-153243 2016-11-12 2016-11-13 2016-11-20 8
306488 K7-113-1132-153244 2016-11-12 2016-11-13 2016-11-18 6
306489 K7-113-1132-153245 2016-11-13 2016-11-14 2016-11-20 7

306490 rows × 5 columns

[33]:
# taking a sample of the data
# we are creating a sample of the data, to use for analysis, statistical testing and visualizations
# using the sample will be easier to visualize and it reduces computational time

sample_size = 0.01
logistics_delay_sample = logistics_delay_df.sample(frac=sample_size, random_state=1)
### a) How is the logistics delay distributed? Justify your choice with statistical tests and briefly describe your approach.

### Data Visualization
#### To get a better understanding of the distribution, we will start with a visualization of the data:

a) How is the logistics delay distributed? Justify your choice with statistical tests and briefly describe your approach.¶

Data Visualization¶

To get a better understanding of the distribution, we will start with a visualization of the data:¶

[34]:
# creating a histogram and density plot of the sample set
plt.figure(figsize=(7, 4))
sns.histplot(logistics_delay_sample['LogisticsDelay'], kde=True, bins=20)
plt.title('Distribution of Logistics Delay')
plt.xlabel('Logistics Delay (days)')
plt.ylabel('Frequency')
plt.show()
<h5> Interpretation: The histogram and density plot of the sample set show multiple peaks suggesting that the data don't follow a normal distribution
Interpretation: The histogram and density plot of the sample set show multiple peaks suggesting that the data don't follow a normal distribution¶
### Statistical Tests
#### After visualizing the data, we perform different statistical test to determine the type of distribution

Statistical Tests¶

After visualizing the data, we perform different statistical test to determine the type of distribution¶

[35]:
# Shapiro-Wilk Test for Normality

stat, p_value = stats.shapiro(logistics_delay_sample['LogisticsDelay'])
print(f'Shapiro-Wilk Test: Statistics={stat}, p-value={p_value}')

# Interpretation of Shapiro-Wilk Test
# H0: The logistics delay follows a normal distribution
# if p value > 0.05, we fail to reject, the data follow a normal distribution
# if p value < 0.05, we reject, the data do not follow a normal distribution


if p_value > 0.05:
print("The logistics delay follows a normal distribution (fail to reject H0).")
else:
print("The logistics delay does not follow a normal distribution (reject H0).")
Shapiro-Wilk Test: Statistics=0.8907687664031982, p-value=3.433181237595802e-42
The logistics delay does not follow a normal distribution (reject H0).
<h5> Interpretation: Both the visual representation and the Shapiro-Wilk Test, indicate that the Logistics Delay does not follow a normal distribution. In this case further testing should be performed.
Interpretation: Both the visual representation and the Shapiro-Wilk Test, indicate that the Logistics Delay does not follow a normal distribution. In this case further testing should be performed.¶
[36]:
### Kolmogorov-Smirnov Test for Exponential Distribution

d_stat, d_p_value = stats.kstest(logistics_delay_sample['LogisticsDelay'], 'expon', args=(logistics_delay_sample['LogisticsDelay'].mean(), logistics_delay_sample['LogisticsDelay'].std()))
print(f'Kolmogorov-Smirnov Test: Statistics={d_stat}, p-value={d_p_value}')

# Interpretation of Kolmogorov-Smirnov Test
# H0: The logistics delay follows an exponential distribution
# if p value > 0.05, we fail to reject, the data follow an exponential distribution
# if p value < 0.05, we reject, the data do not follow an exponential distribution

if d_p_value > 0.05:
print("The logistics delay follows an exponential distribution (fail to reject H0).")
else:
print("The logistics delay does not follow an exponential distribution (reject H0).")
Kolmogorov-Smirnov Test: Statistics=0.7158238172920065, p-value=0.0
The logistics delay does not follow an exponential distribution (reject H0).
<h5> Interpretation: The Kolmogorov-Smirnov Test indicate that the Logistic Delay does not follow an exponential distribution.
Interpretation: The Kolmogorov-Smirnov Test indicate that the Logistic Delay does not follow an exponential distribution.¶
[37]:
### Kolmogorov-Smirnov Test for Gamma Distribution
gamma_params = stats.gamma.fit(logistics_delay_sample['LogisticsDelay'])
d_stat_gamma, d_p_value_gamma = stats.kstest(logistics_delay_sample['LogisticsDelay'], 'gamma', args=gamma_params)
print(f'Kolmogorov-Smirnov Test for Gamma Distribution: Statistics={d_stat_gamma}, p-value={d_p_value_gamma}')


# Interpretation of Kolmogorov-Smirnov Test
# H0: The logistics delay follows a gamma distribution
# if p value > 0.05, we fail to reject, the data follow a gamma distribution
# if p value < 0.05, we reject, the data do not follow a gamma distribution


if d_p_value_gamma > 0.05:
print("The logistics delay follows a gamma distribution (fail to reject H0).")
else:
print("The logistics delay does not follow a gamma distribution (reject H0).")
Kolmogorov-Smirnov Test for Gamma Distribution: Statistics=0.21600036314061488, p-value=5.208266151703579e-126
The logistics delay does not follow a gamma distribution (reject H0).
<h5> Interpretation: The Kolmogorov-Smirnov Test indicate that the Logistic Delay does not follow a gamma distribution.
Interpretation: The Kolmogorov-Smirnov Test indicate that the Logistic Delay does not follow a gamma distribution.¶
[38]:
### Kolmogorov-Smirnov Test for Log-Normal Distribution
lognorm_params = stats.lognorm.fit(logistics_delay_sample['LogisticsDelay'])
d_stat_lognorm, d_p_value_lognorm = stats.kstest(logistics_delay_sample['LogisticsDelay'], 'lognorm', args=lognorm_params)
print(f'Kolmogorov-Smirnov Test for Log-Normal Distribution: Statistics={d_stat_lognorm}, p-value={d_p_value_lognorm}')


# Interpretation of Kolmogorov-Smirnov Test
# H0: The logistics delay follows a Log-Normal distribution
# if p value > 0.05, we fail to reject, the data follow a Log-Normal distribution
# if p value < 0.05, we reject, the data do not follow a Log-Normal distribution

if d_p_value_lognorm > 0.05:
print("The logistics delay follows a log-normal distribution (fail to reject H0).")
else:
print("The logistics delay does not follow a log-normal distribution (reject H0).")
Kolmogorov-Smirnov Test for Log-Normal Distribution: Statistics=0.2179015832535024, p-value=2.9672178452961654e-128
The logistics delay does not follow a log-normal distribution (reject H0).
<h5> Interpretation: The Kolmogorov-Smirnov Test indicate that the Logistic Delay does not follow a log-normal distribution.
Interpretation: The Kolmogorov-Smirnov Test indicate that the Logistic Delay does not follow a log-normal distribution.¶
<h4>The results of our statistical tests indicate that the data do not follow a normal, exponential, gamma or log-normal distribution. The Logistics Delay data seem to have a complex distribution that is not captured by these parametric distributions.

The results of our statistical tests indicate that the data do not follow a normal, exponential, gamma or log-normal distribution. The Logistics Delay data seem to have a complex distribution that is not captured by these parametric distributions.¶

<h4> To get more insight of the data, we can use the Kernel Density Estimation (KDE), to model the distribution without assuming any specific parametric form.

To get more insight of the data, we can use the Kernel Density Estimation (KDE), to model the distribution without assuming any specific parametric form.¶

[39]:
plt.figure(figsize=(7, 4))
sns.kdeplot(logistics_delay_sample['LogisticsDelay'], shade=True)
plt.title('Kernel Density Estimation of Logistics Delay (Sample)')
plt.xlabel('Logistics Delay (days)')
plt.ylabel('Density')
plt.show()
<h4> The KDE plot shows multiple peaks, reinforcing the indication that the logistics delay is multimodal. Specifically we notice peaks around 6, 7, 8 and 9. Which suggests that there may exist distinct groups (clusters) within the data. <h4>
To get a better understanding, we can use unsupervised learning, to identify patterns and structures within the data. More specifically: Cluster Analysis.

The KDE plot shows multiple peaks, reinforcing the indication that the logistics delay is multimodal. Specifically we notice peaks around 6, 7, 8 and 9. Which suggests that there may exist distinct groups (clusters) within the data. ¶

To get a better understanding, we can use unsupervised learning, to identify patterns and structures within the data. More specifically: Cluster Analysis.

¶

[40]:
# Cluster Analysis

# reshape data for clustering
data_for_clustering = logistics_delay_sample['LogisticsDelay'].values.reshape(-1, 1)

# perform K-Means clustering with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=1).fit(data_for_clustering)
logistics_delay_sample['Cluster'] = kmeans.labels_

# plot clusters
plt.figure(figsize=(8, 5))
sns.histplot(data=logistics_delay_sample, x='LogisticsDelay', hue='Cluster', kde=True, bins=25, palette='Set1')
plt.title('Logistics Delay in Clusters')
plt.xlabel('Logistics Delay (in days)')
plt.ylabel('Frequency')
plt.show()


# Describe the Clusters
cluster_0 = logistics_delay_sample[logistics_delay_sample['Cluster'] == 0]
cluster_1 = logistics_delay_sample[logistics_delay_sample['Cluster'] == 1]
cluster_2 = logistics_delay_sample[logistics_delay_sample['Cluster'] == 2]
cluster_3 = logistics_delay_sample[logistics_delay_sample['Cluster'] == 3]
cluster_4 = logistics_delay_sample[logistics_delay_sample['Cluster'] == 4]

summary_0 = cluster_0['LogisticsDelay'].describe()
summary_1 = cluster_1['LogisticsDelay'].describe()
summary_2 = cluster_2['LogisticsDelay'].describe()
summary_3 = cluster_3['LogisticsDelay'].describe()
summary_4 = cluster_4['LogisticsDelay'].describe()

print("Cluster 0", summary_0)
print("Cluster 1", summary_1)
print("Cluster 2", summary_2)
print("Cluster 3", summary_3)
print("Cluster 4", summary_4)
Cluster 0 count    1281.0
mean        7.0
std         0.0
min         7.0
25%         7.0
50%         7.0
75%         7.0
max         7.0
Name: LogisticsDelay, dtype: float64
Cluster 1 count    638.0
mean       8.0
std        0.0
min        8.0
25%        8.0
50%        8.0
75%        8.0
max        8.0
Name: LogisticsDelay, dtype: float64
Cluster 2 count    838.0
mean       6.0
std        0.0
min        6.0
25%        6.0
50%        6.0
75%        6.0
max        6.0
Name: LogisticsDelay, dtype: float64
Cluster 3 count    233.000000
mean       9.253219
std        0.541629
min        9.000000
25%        9.000000
50%        9.000000
75%        9.000000
max       12.000000
Name: LogisticsDelay, dtype: float64
Cluster 4 count    75.0
mean      5.0
std       0.0
min       5.0
25%       5.0
50%       5.0
75%       5.0
max       5.0
Name: LogisticsDelay, dtype: float64
<h4>After performing the Cluster Analysis we can confirm that the distribution is multimodal. More specifically, delays cluster around:
5 days
6 days
7 days
8 days
9 to 12 days

This indicates that the delays aren't spread evenly across a range but are instead concentrated at specific points.

After performing the Cluster Analysis we can confirm that the distribution is multimodal. More specifically, delays cluster around:
5 days
6 days
7 days
8 days
9 to 12 days

This indicates that the delays aren't spread evenly across a range but are instead concentrated at specific points.

¶

<h3> b. Determine the mean logistics delay, considering weekends. Interpret this number and discuss possible alternatives.

b. Determine the mean logistics delay, considering weekends. Interpret this number and discuss possible alternatives.¶

[41]:
#calculating the mean of the logistics delay dataframe using the describe() method.

logistics_delay_df['LogisticsDelay'].describe()
[41]:
count    306490.000000
mean          7.080437
std           1.012302
min           4.000000
25%           6.000000
50%           7.000000
75%           8.000000
max          15.000000
Name: LogisticsDelay, dtype: float64
The mean logistics delay, considering the weekends, is calculated to be 7.08 days. This number reflects the average time it takes for goods to move through the logistics process, including the time when operations might be slower or paused over the weekend. We can see that also during the Cluster Analysis. Cluster 0, in which the delay is 7 days, has the highest count of observations.
Impact of the weekend:
Including the weekends in the calculation of the logistics delay can artificially inflate the delay times. For instance, if goods are produced on a Friday, they might not move forward in the process until the following business day, Monday, adding several days to the logistics delay.


A possible alternative would be to implement or increase weekend operations. For example, some automated processes or partial shifts. This way, the impact of the non-working time during the weekend would be minimized.
Another alternative would be to ensure that goods are not produced on a Friday, that way ensures that the time it takes for goods to move through the logistics process will not be artificially inflated.

The mean logistics delay, considering the weekends, is calculated to be 7.08 days. This number reflects the average time it takes for goods to move through the logistics process, including the time when operations might be slower or paused over the weekend. We can see that also during the Cluster Analysis. Cluster 0, in which the delay is 7 days, has the highest count of observations.

Impact of the weekend: Including the weekends in the calculation of the logistics delay can artificially inflate the delay times. For instance, if goods are produced on a Friday, they might not move forward in the process until the following business day, Monday, adding several days to the logistics delay.

A possible alternative would be to implement or increase weekend operations. For example, some automated processes or partial shifts. This way, the impact of the non-working time during the weekend would be minimized.

Another alternative would be to ensure that goods are not produced on a Friday, that way ensures that the time it takes for goods to move through the logistics process will not be artificially inflated.

<h3> c. Visualize the distribution appropriately by displaying the histogram and density function using “plotly.” Describe how you selected the size of the bins

c. Visualize the distribution appropriately by displaying the histogram and density function using “plotly.” Describe how you selected the size of the bins¶

[42]:
#create a histogram and ovelay a density function
fig = ff.create_distplot(hist_data=[logistics_delay_sample['LogisticsDelay']], group_labels=['Logistics Delay'], bin_size=1, show_hist=True, show_curve=True)

#customize the layout
fig.update_layout(
title='Distribution of Logistics Delays',
xaxis_title='Logistics Delay (days)',
bargap=0.2)

#show the plot
fig.show()

00.20.40.60.856789101112
Logistics DelayDistribution of Logistics DelaysLogistics Delay (days)
plotly-logomark
The bin size for a histogram is crucial because it affects the presentation of the data. In this particualr case, since the data is discrete - delays measured in whole days - a bin size of 1 (day) is appropriate.
This way each delay day has its own bin, clearly highlighting the distribution of the delay time.

The bin size for a histogram is crucial because it affects the presentation of the data. In this particualr case, since the data is discrete - delays measured in whole days - a bin size of 1 (day) is appropriate. This way each delay day has its own bin, clearly highlighting the distribution of the delay time.

<h3> d) Describe the process for creating a decision tree to classify whether the
component (K7) is defective (Fehlerhaft) or not. (Hint: Use visualizations.)

d) Describe the process for creating a decision tree to classify whether the component (K7) is defective (Fehlerhaft) or not. (Hint: Use visualizations.)¶

### Step 1: Import Packages

Import packages that will be used for the creation of the tree:
matplotlib.pyplot, sklearn etc

### Step 2: Data Preparation
#### Merge Datasets
Combine Komponente_K7 - including the production details of K7 & Logistikverzug_K7 - which includes the logistics delay of K7

#### Handle Data
Examine the data, check for missing values and decide how to hanlde them (this can be either imputing mean/median or deleting rows with missing data)


### Step 3 Visualize Data
Plot a correlation matrix to see how the different features (production date, logistics delay, Herstellernummer, Werknummer) interact with the defectiveness of K7. This can help select the most relevant features for the model.

In addition to that, a boxplot or a histogram can be used to determine the distribution of the data and get a deeper understanding.

Step 1: Import Packages¶

Import packages that will be used for the creation of the tree: matplotlib.pyplot, sklearn etc

Step 2: Data Preparation¶

Merge Datasets¶

Combine Komponente_K7 - including the production details of K7 & Logistikverzug_K7 - which includes the logistics delay of K7

Handle Data¶

Examine the data, check for missing values and decide how to hanlde them (this can be either imputing mean/median or deleting rows with missing data)

Step 3 Visualize Data¶

Plot a correlation matrix to see how the different features (production date, logistics delay, Herstellernummer, Werknummer) interact with the defectiveness of K7. This can help select the most relevant features for the model.

In addition to that, a boxplot or a histogram can be used to determine the distribution of the data and get a deeper understanding.

Step 4 Define Features and Target Variable¶

Target Variable:¶

Is the componenent defective or not?

Features:¶

Determine the features that are most relevant for target variable - for example one of the primary features can be Logistics Delay, or a combination of the Logistics Delay and Herstellernummer.

Step 5: Splitting the data¶

Split the dataset into a training set and a testing set. The training set will be used to train the model and the test set to evaluate the performance of the decision tree. This split can have a balance 70% training and 30% testing.

Step 6: Building the decision tree¶

Use a decision tree classifier from the scikit-learn library. The decision tree algorithm will automatically determine the best splits in the data to classify components as defective or not.

Consider tuning hyperprarameters like: maximum depth of the tree, minimum samples per leaf.

Step 7: Evaluate the model¶

Confusion matrix: Evaluate the model using a confusion matrix to determine the number ofof true positives, false positives, true negatives, and false negatives.

Step 8: Visualize the decision tree¶

Visualize the tree. For example using plot_tree from scikit-learn

Step 9: Final model interpretation¶

Interprete the tree's structure

<h3> 2. Data Storage in Separate Files

2. Data Storage in Separate Files¶

<h3> Explain why it makes sense to store the available data in separate files instead of
saving everything in one large table. Name at least four benefits. The available tables
represent a typical database structure. What is this structure called?

Explain why it makes sense to store the available data in separate files instead of saving everything in one large table. Name at least four benefits. The available tables represent a typical database structure. What is this structure called?¶

#### Data security:
When saved in separate files, sensitive data are better protected. For example, in case of data corruption or loss, the impact is limited to the affected files rather to the whole dataset.

#### Enhanced performance
Large tables can lead to slower performance and longer processing times. By creating separate files, queries can be executed faster and more efficiently, as the exposure to unecessery data is limited.

#### Data Integrity
When data are stored in smaller separate files, the complexity of managing and processing the data is reduced. This leads to fewer errors, which can help maintain accuracy and consistency.

#### Scalability and ease of integration.
As the data grows, separate files allows to scale more efficiently. New files can be added, or older ones can be removed, without having to mess with the whole system. Similarly, separate files are more easely integrated in different systems.

Data security:¶

When saved in separate files, sensitive data are better protected. For example, in case of data corruption or loss, the impact is limited to the affected files rather to the whole dataset.

Enhanced performance¶

Large tables can lead to slower performance and longer processing times. By creating separate files, queries can be executed faster and more efficiently, as the exposure to unecessery data is limited.

Data Integrity¶

When data are stored in smaller separate files, the complexity of managing and processing the data is reduced. This leads to fewer errors, which can help maintain accuracy and consistency.

Scalability and ease of integration.¶

As the data grows, separate files allows to scale more efficiently. New files can be added, or older ones can be removed, without having to mess with the whole system. Similarly, separate files are more easely integrated in different systems.

Easier maintenance and debugging¶

If there is an issue with the data, identifying and fixing the problem is easier when the data is organized into separate files. In addition to that, tasks such as cleaning, or updating can be performed quicker ,as one is focused on specific files instead of the whole data set.

The name of such a database structure is: Relational Database Structure¶

<H4> Task 3 : Determine how many parts T16 ended up in vehicles registered in Adelshofen.

Task 3 : Determine how many parts T16 ended up in vehicles registered in Adelshofen.¶

<h6>Extract the unique column names (without .x or .y)
Extract the unique column names (without .x or .y)¶
[43]:
base_columns = set(col.split('.')[0] for col in Einzelteil.columns)
<h6>Initialize an empty DataFrame to store the combined columns
Initialize an empty DataFrame to store the combined columns¶
[44]:
combined_df = pd.DataFrame()
[45]:
# Loop through each unique base column
for col in base_columns:
# Combine columns with suffix .x, .y, and the original column
if f"{col}.x" in Einzelteil.columns and f"{col}.y" in Einzelteil.columns:
combined_df[col] = Einzelteil[col].combine_first(Einzelteil[f"{col}.x"]).combine_first(Einzelteil[f"{col}.y"])
elif f"{col}.x" in Einzelteil.columns:
combined_df[col] = Einzelteil[col].combine_first(Einzelteil[f"{col}.x"])
elif f"{col}.y" in Einzelteil.columns:
combined_df[col] = Einzelteil[col].combine_first(Einzelteil[f"{col}.y"])
else:
combined_df[col] = Einzelteil[col]

[46]:
combined_df.head()
[46]:
Herstellernummer Fehlerhaft_Fahrleistung Fehlerhaft_Datum Produktionsdatum Werksnummer ID_T16 X1 Fehlerhaft
1 212.0 0.0 NaN 2008-11-07 2121.0 16-212-2121-7 1 0.0
2 212.0 0.0 NaN 2008-11-08 2122.0 16-212-2122-41 2 0.0
3 212.0 0.0 NaN 2008-11-07 2121.0 16-212-2121-36 5 0.0
4 212.0 0.0 NaN 2008-11-07 2122.0 16-212-2122-20 10 0.0
5 212.0 0.0 NaN 2008-11-07 2122.0 16-212-2122-33 12 0.0
[47]:
combined_df.nunique()
[47]:
Herstellernummer                3
Fehlerhaft_Fahrleistung       426
Fehlerhaft_Datum             2899
Produktionsdatum             2914
Werksnummer                     5
ID_T16                     818844
X1                         818844
Fehlerhaft                      2
dtype: int64
<H6> Merge all the files
Merge all the files¶
[48]:
# Step 1: Join combined_df and Komponente_df on ID_T16
merged_df1 = pd.merge(combined_df, Komponente_df, on='ID_T16', how='inner')

# Step 2: Join the result with Fahrzeuge_Merged_df on ID_Komponente
merged_df2 = pd.merge(merged_df1, Fahrzeuge_Merged_df, on='ID_Komponente', how='inner')

# Step 3: Join the result with zulassungen_df on IDNummer
final_merged_df = pd.merge(merged_df2, zulassungen_df, on='IDNummer', how='inner')

# Display the first few rows of the final merged DataFrame
final_merged_df.head()
[48]:
Herstellernummer Fehlerhaft_Fahrleistung Fehlerhaft_Datum Produktionsdatum Werksnummer ID_T16 X1 Fehlerhaft ID_Komponente IDNummer Unnamed: 0 Gemeinden Zulassung
0 212.0 0.0 NaN 2008-11-07 2121.0 16-212-2121-7 1 0.0 K2ST2-109-1092-2 22-2-21-1 2897615 RIESA 02-01-2009
1 212.0 0.0 NaN 2008-11-08 2122.0 16-212-2122-41 2 0.0 K2ST2-109-1092-59 22-2-21-2 2897616 GROEDITZ 02-01-2009
2 212.0 0.0 NaN 2008-11-07 2121.0 16-212-2121-36 5 0.0 K2ST2-109-1092-7 22-2-21-5 2897619 NAUNHOF 02-01-2009
3 212.0 0.0 NaN 2008-11-07 2122.0 16-212-2122-20 10 0.0 K2ST2-109-1092-41 22-2-21-10 2897624 BEESKOW 02-01-2009
4 212.0 0.0 NaN 2008-11-07 2122.0 16-212-2122-33 12 0.0 K2ST2-109-1092-67 22-2-21-12 2897626 EISENHUETTENSTADT 02-01-2009
<H6> Filter the DataFrame where Gemeinden is 'Adelshofen'
Filter the DataFrame where Gemeinden is 'Adelshofen'¶
[49]:

filtered_df = final_merged_df[final_merged_df['Gemeinden'] == 'ADELSHOFEN']


<h6> Count the number of unique ID_T16 values
Count the number of unique ID_T16 values¶
[50]:

unique_id_count = filtered_df['ID_T16'].nunique()
[51]:
print(f"Number of unique ID_T16 where Gemeinden = 'Adelshofen': {unique_id_count}")
Number of unique ID_T16 where Gemeinden = 'Adelshofen': 8
<H4> Task 4 : Identify the data types of the attributes in the registration table
“Zulassungen_aller_Fahrzeuge.” Present your answers in a table integrated into your
Markdown document and describe the characteristics of the data types.

Task 4 : Identify the data types of the attributes in the registration table “Zulassungen_aller_Fahrzeuge.” Present your answers in a table integrated into your Markdown document and describe the characteristics of the data types.¶

[52]:
# Checking data types of the attributes in Zulassungen_alle_Fahrzeuge
data_types = zulassungen_df.dtypes
[53]:
# Presenting data types in a table format
data_types_table = pd.DataFrame(data_types, columns=['Data Type'])
print(data_types_table)
           Data Type
Unnamed: 0     int64
IDNummer      object
Gemeinden     object
Zulassung     object
[54]:
# Rename the Unnamed: 0 column
zulassungen_df.rename(columns={'Unnamed: 0': 'Index'}, inplace=True)
[55]:
# Convert Zulassung to datetime
zulassungen_df['Zulassung'] = pd.to_datetime(zulassungen_df['Zulassung'], format='%d-%m-%Y')
[56]:
# Check the unique values and count of the "Gemeinden" column
unique_gemeinden = zulassungen_df['Gemeinden'].nunique()
total_gemeinden = zulassungen_df['Gemeinden'].count()

print(f"Total entries in 'Gemeinden': {total_gemeinden}")
print(f"Unique values in 'Gemeinden': {unique_gemeinden}")
Total entries in 'Gemeinden': 1048575
Unique values in 'Gemeinden': 5752
[57]:
# If unique values are significantly less than total, converting to category is efficient
if unique_gemeinden < total_gemeinden / 2: # Arbitrary threshold; adjust as needed
zulassungen_df['Gemeinden'] = zulassungen_df['Gemeinden'].astype('category')
print("Converted 'Gemeinden' column to category type.")
else:
zulassungen_df['Gemeinden'] = zulassungen_df['Gemeinden'].astype('str')
print("Converted 'Gemeinden' column to string type.")
Converted 'Gemeinden' column to category type.
[58]:
# Check the updated data types
print("Data types after converting 'Gemeinden':")
print(zulassungen_df.dtypes)
Data types after converting 'Gemeinden':
Index                 int64
IDNummer             object
Gemeinden          category
Zulassung    datetime64[ns]
dtype: object
[59]:
# Update the data types table after conversion
data_types_table = pd.DataFrame(zulassungen_df.dtypes, columns=['Data Type'])
[60]:
# Normalize the Data Type values to lowercase
data_types_table['Data Type'] = data_types_table['Data Type'].astype(str).str.lower()
[61]:
# Describe the characteristics of each data type
characteristics = {
'int64': 'Integer data type, used for numeric data.',
'float64': 'Floating point data type, used for numeric data with decimals.',
'object': 'Object data type, often used for text data or mixed data types.',
'datetime64[ns]': 'Datetime data type, used for date and time information.',
'category': 'Category data type, used for categorical data to save memory.'
}
[62]:
# Adding characteristics to the table using .get() with a default value
data_types_table['Characteristics'] = data_types_table['Data Type'].apply(lambda x: characteristics.get(x, 'Unknown data type'))
[63]:
print(data_types_table)
                Data Type                                    Characteristics
Index               int64          Integer data type, used for numeric data.
IDNummer           object  Object data type, often used for text data or ...
Gemeinden        category  Category data type, used for categorical data ...
Zulassung  datetime64[ns]  Datetime data type, used for date and time inf...
[64]:
# Convert the table to markdown format
from tabulate import tabulate
markdown_table = tabulate(data_types_table, headers='keys', tablefmt='pipe')

[65]:
# Print the markdown table
print("\nMarkdown Table:\n")
print(markdown_table)
Markdown Table:

|           | Data Type      | Characteristics                                                 |
|:----------|:---------------|:----------------------------------------------------------------|
| Index     | int64          | Integer data type, used for numeric data.                       |
| IDNummer  | object         | Object data type, often used for text data or mixed data types. |
| Gemeinden | category       | Category data type, used for categorical data to save memory.   |
| Zulassung | datetime64[ns] | Datetime data type, used for date and time information.         |
<H4> Data Types in "Zulassungen_aller_Fahrzeuge"

##### Data Types and Characteristics

| Attribute | Data Type | Characteristics |
|----------------|-----------------|-----------------------------------------------------------|
| Index | `int64` | Integer data type, used for numeric data. |
| IDNummer | `object` | Object data type, often used for text data or mixed data types. |
| Gemeinden | `category` | Category data type, used for categorical data to save memory. |
| Zulassung | `datetime64[ns]`| Datetime data type, used for date and time information. |

Data Types in "Zulassungen_aller_Fahrzeuge" ¶

Data Types and Characteristics¶
Attribute Data Type Characteristics
Index int64 Integer data type, used for numeric data.
IDNummer object Object data type, often used for text data or mixed data types.
Gemeinden category Category data type, used for categorical data to save memory.
Zulassung datetime64[ns] Datetime data type, used for date and time information.
<h4> Task 4 : Create a linear model from the table “Fahrzeuge_OEM1_Typ11_Fehleranalyse”
relating mileage to suitable variables. Derive recommendations for OEM1 based on
this model.

Task 4 : Create a linear model from the table “Fahrzeuge_OEM1_Typ11_Fehleranalyse” relating mileage to suitable variables. Derive recommendations for OEM1 based on this model.¶

[66]:
# Dropping the first two columns as they contain the same values
Fahrzeuge_df = Fahrzeuge_df.drop(columns=['Unnamed: 0', 'X1'])


[67]:
# Display the updated dataframe structure
Fahrzeuge_df.head()
[67]:
X ID_Fahrzeug Herstellernummer Werksnummer Fehlerhaft_Datum Fehlerhaft_Fahrleistung days fuel engine
0 9 11-1-11-9 1 11 2010-03-16 34824.319559 1493.150761 4.003670 small
1 11 11-1-11-11 1 11 2010-03-16 74217.428309 1044.462231 11.042487 large
2 13 11-1-11-13 1 11 2010-03-16 32230.699639 749.669810 3.579117 small
3 15 11-1-11-15 1 11 2010-03-16 44885.783551 858.688003 4.666801 small
4 37 11-1-11-37 1 11 2010-03-17 86348.329866 1478.204174 4.634381 small
[68]:
print(Fahrzeuge_df.dtypes)

X                            int64
ID_Fahrzeug                 object
Herstellernummer             int64
Werksnummer                  int64
Fehlerhaft_Datum            object
Fehlerhaft_Fahrleistung    float64
days                       float64
fuel                       float64
engine                      object
dtype: object
[69]:
# Convert categorical variables into dummy/indicator variables
Fahrzeuge_df = pd.get_dummies(Fahrzeuge_df, columns=['Herstellernummer', 'Werksnummer', 'engine'], drop_first=True)

[70]:
Fahrzeuge_df.head()
[70]:
X ID_Fahrzeug Fehlerhaft_Datum Fehlerhaft_Fahrleistung days fuel Werksnummer_12 engine_medium engine_small
0 9 11-1-11-9 2010-03-16 34824.319559 1493.150761 4.003670 False False True
1 11 11-1-11-11 2010-03-16 74217.428309 1044.462231 11.042487 False False False
2 13 11-1-11-13 2010-03-16 32230.699639 749.669810 3.579117 False False True
3 15 11-1-11-15 2010-03-16 44885.783551 858.688003 4.666801 False False True
4 37 11-1-11-37 2010-03-17 86348.329866 1478.204174 4.634381 False False True
[71]:
print(Fahrzeuge_df.dtypes)
X                            int64
ID_Fahrzeug                 object
Fehlerhaft_Datum            object
Fehlerhaft_Fahrleistung    float64
days                       float64
fuel                       float64
Werksnummer_12                bool
engine_medium                 bool
engine_small                  bool
dtype: object
[72]:
# Convert Zulassung to datetime
Fahrzeuge_df['Fehlerhaft_Datum'] = pd.to_datetime(Fahrzeuge_df['Fehlerhaft_Datum'], format='mixed')
[73]:
# Define the feature variables (X) and target variable (y)
X = Fahrzeuge_df.drop(columns=['Fehlerhaft_Fahrleistung', 'Fehlerhaft_Datum', 'Werksnummer_12', 'days', 'ID_Fahrzeug'])
y = Fahrzeuge_df['Fehlerhaft_Fahrleistung']
[74]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
[75]:
# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
[75]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
[76]:
# Predict the mileage on the test set
y_pred = model.predict(X_test)
[77]:
# Calculate the model's performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
[78]:
# Output the coefficients, MSE, and R-squared value
coefficients = pd.Series(model.coef_, index=X.columns)
model_summary = {
"Mean Squared Error": mse,
"R-squared": r2,
"Coefficients": coefficients
}

model_summary
[78]:
{'Mean Squared Error': 123515188.0775618,
 'R-squared': 0.4900147838613257,
 'Coefficients': X                   -0.000054
 fuel              5213.628779
 engine_medium     7705.514824
 engine_small     10954.392346
 dtype: float64}
[79]:
# Calculate predictions for the training set to plot the model's fit
y_train_pred = model.predict(X_train)

# Scatter plot of actual vs predicted mileage (Training set)
plt.figure(figsize=(10, 6))
plt.scatter(y_train, y_train_pred, color='blue', alpha=0.5)
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'k--', lw=3)
plt.xlabel('Actual Mileage (Fehlerhaft_Fahrleistung)')
plt.ylabel('Predicted Mileage (Fehlerhaft_Fahrleistung)')
plt.title('Actual vs Predicted Mileage (Training Set)')
plt.grid(True)
plt.show()

<h4> Task 6 : On 11.08.2010, there was a hit-and-run accident. The license plate of the car
involved is unknown. The police have asked for your assistance, as you work for the
Federal Motor Transport Authority, to find out where the vehicle with body part
number “K5-112-1122-79” was registered.

Task 6 : On 11.08.2010, there was a hit-and-run accident. The license plate of the car involved is unknown. The police have asked for your assistance, as you work for the Federal Motor Transport Authority, to find out where the vehicle with body part number “K5-112-1122-79” was registered.¶

[80]:
# Find the vehicle ID corresponding to the body part number "K5-112-1122-79"
vehicle_info = Bestandteile_Fahrzeuge_OEM1_Typ12_df[Bestandteile_Fahrzeuge_OEM1_Typ12_df['ID_Karosserie'] == 'K5-112-1122-79']
[81]:
# Extract the ID_Fahrzeug
vehicle_id = vehicle_info['ID_Fahrzeug'].values[0] if not vehicle_info.empty else None
vehicle_id
[81]:
'12-1-12-82'
[82]:
# Find the registration details using the vehicle ID
registration_info = zulassungen_df[zulassungen_df['IDNummer'] == vehicle_id]
[83]:
# Extract the relevant registration details
registration_place = registration_info['Gemeinden'].values[0] if not registration_info.empty else None
registration_place
[83]:
'ASCHERSLEBEN'
<H4>The vehicle with the body part number "K5-112-1122-79" was registered in Aschersleben

The vehicle with the body part number "K5-112-1122-79" was registered in Aschersleben¶

Common Tools
No metadata.
Advanced Tools
No metadata.
  • Console
  • Change Kernel…
  • Clear Console Cells
  • Close and Shut Down…
  • Insert Line Break
  • Interrupt Kernel
  • New Console
  • Restart Kernel…
  • Run Cell (forced)
  • Run Cell (unforced)
  • Show All Kernel Activity
  • Display Languages
  • English
    English
  • File Operations
  • Autosave Documents
  • Download
    Download the file to your computer
  • Reload Notebook from Disk
    Reload contents from disk
  • Revert Notebook to Checkpoint…
    Revert contents to previous checkpoint
  • Save Notebook
    Save and create checkpoint
    Ctrl+S
  • Save Notebook As…
    Save with new path
    Ctrl+Shift+S
  • Trust HTML File
    Whether the HTML file is trusted. Trusting the file allows scripts to run in it, which may result in security risks. Only enable for files you trust.
  • Help
  • About Jupyter Notebook
  • Launch Jupyter Notebook File Browser
  • Show Keyboard Shortcuts
    Show relevant keyboard shortcuts for the current active widget
    Ctrl+Shift+H
  • Image Viewer
  • Flip image horizontally
    H
  • Flip image vertically
    V
  • Invert Colors
    I
  • Reset Image
    0
  • Rotate Clockwise
    ]
  • Rotate Counterclockwise
    [
  • Zoom In
    =
  • Zoom Out
    -
  • Kernel Operations
  • Shut Down All Kernels…
  • Main Area
  • Close All Other Tabs
  • Close Tab
    Alt+W
  • Close Tabs to Right
  • End Search
    Esc
  • Find Next
    Ctrl+G
  • Find Previous
    Ctrl+Shift+G
  • Find…
    Ctrl+F
  • Log Out
    Log out of JupyterLab
  • Shut Down
    Shut down JupyterLab
  • Mode
  • Toggle Zen Mode
  • Notebook Cell Operations
  • Change to Code Cell Type
    Y
  • Change to Heading 1
    1
  • Change to Heading 2
    2
  • Change to Heading 3
    3
  • Change to Heading 4
    4
  • Change to Heading 5
    5
  • Change to Heading 6
    6
  • Change to Markdown Cell Type
    M
  • Change to Raw Cell Type
    R
  • Clear Cell Output
    Clear outputs for the selected cells
  • Collapse All Code
  • Collapse All Outputs
  • Collapse Selected Code
  • Collapse Selected Outputs
  • Copy Cell
    Copy this cell
    C
  • Cut Cell
    Cut this cell
    X
  • Delete Cell
    Delete this cell
    D, D
  • Disable Scrolling for Outputs
  • Enable Scrolling for Outputs
  • Expand All Code
  • Expand All Outputs
  • Expand Selected Code
  • Expand Selected Outputs
  • Extend Selection Above
    Shift+K
  • Extend Selection Below
    Shift+J
  • Extend Selection to Bottom
    Shift+End
  • Extend Selection to Top
    Shift+Home
  • Insert Cell Above
    Insert a cell above
    A
  • Insert Cell Below
    Insert a cell below
    B
  • Insert Heading Above Current Heading
    Shift+A
  • Insert Heading Below Current Heading
    Shift+B
  • Merge Cell Above
    Ctrl+Backspace
  • Merge Cell Below
    Ctrl+Shift+M
  • Merge Selected Cells
    Shift+M
  • Move Cell Down
    Move this cell down
    Ctrl+Shift+Down
  • Move Cell Up
    Move this cell up
    Ctrl+Shift+Up
  • Paste Cell Above
    Paste this cell from the clipboard
  • Paste Cell and Replace
  • Paste Cell Below
    Paste this cell from the clipboard
    V
  • Redo Cell Operation
    Shift+Z
  • Render Side-by-Side
    Shift+R
  • Run Selected Cell
    Run this cell and advance
    Shift+Enter
  • Run Selected Cell and Do not Advance
    Ctrl+Enter
  • Run Selected Cell and Insert Below
    Alt+Enter
  • Run Selected Text or Current Line in Console
  • Select Cell Above
    K
  • Select Cell Below
    J
  • Select Heading Above or Collapse Heading
    Left
  • Select Heading Below or Expand Heading
    Right
  • Set side-by-side ratio
  • Split Cell
    Ctrl+Shift+-
  • Undo Cell Operation
    Z
  • Notebook Operations
  • Change Kernel…
  • Clear Outputs of All Cells
    Clear all outputs of all cells
  • Close and Shut Down Notebook
  • Collapse All Headings
    Ctrl+Shift+Left
  • Deselect All Cells
  • Enter Command Mode
    Ctrl+M
  • Enter Edit Mode
    Enter
  • Expand All Headings
    Ctrl+Shift+Right
  • Interrupt Kernel
    Interrupt the kernel
  • New Console for Notebook
  • New Notebook
    Create a new notebook
  • Open with Panel in New Browser Tab
  • Preview Notebook with Panel
  • Reconnect to Kernel
  • Render All Markdown Cells
  • Restart Kernel and Clear Outputs of All Cells…
    Restart the kernel and clear all outputs of all cells
  • Restart Kernel and Debug…
    Restart Kernel and Debug…
  • Restart Kernel and Run All Cells…
    Restart the kernel and run all cells
  • Restart Kernel and Run up to Selected Cell…
  • Restart Kernel…
    Restart the kernel
  • Run All Above Selected Cell
  • Run All Cells
    Run all cells
  • Run Selected Cell and All Below
  • Save and Export Notebook: Asciidoc
  • Save and Export Notebook: Executable Script
  • Save and Export Notebook: HTML
  • Save and Export Notebook: LaTeX
  • Save and Export Notebook: Markdown
  • Save and Export Notebook: PDF
  • Save and Export Notebook: Qtpdf
  • Save and Export Notebook: Qtpng
  • Save and Export Notebook: ReStructured Text
  • Save and Export Notebook: Reveal.js Slides
  • Save and Export Notebook: Webpdf
  • Select All Cells
    Ctrl+A
  • Show Line Numbers
  • Toggle Collapse Notebook Heading
  • Trust Notebook
  • Other
  • Open in JupyterLab
    JupyterLab
  • Terminal
  • Decrease Terminal Font Size
  • Increase Terminal Font Size
  • New Terminal
    Start a new terminal session
  • Refresh Terminal
    Refresh the current terminal session
  • Use Terminal Theme: Dark
    Set the terminal theme
  • Use Terminal Theme: Inherit
    Set the terminal theme
  • Use Terminal Theme: Light
    Set the terminal theme
  • Text Editor
  • Decrease Font Size
  • Increase Font Size
  • New Markdown File
    Create a new markdown file
  • New Python File
    Create a new Python file
  • New Text File
    Create a new text file
  • Spaces: 1
  • Spaces: 2
  • Spaces: 4
  • Spaces: 4
  • Spaces: 8
  • Theme
  • Decrease Code Font Size
  • Decrease Content Font Size
  • Decrease UI Font Size
  • Increase Code Font Size
  • Increase Content Font Size
  • Increase UI Font Size
  • Theme Scrollbars
  • Use Theme: JupyterLab Dark
  • Use Theme: JupyterLab Light
  • View
  • File Browser
  • Open JupyterLab
  • Show Header
  • Show Notebook Tools
    Show Show Notebook Tools in the right sidebar
  • Show Table of Contents
    Show Show Table of Contents in the left sidebar